Artificial Intelligence and Machine Learning

Ensemble Techniques

Company "Visit for Us"

Project 4: Travel Package Purchase Prediction

By Dario H. Romero

alraisholidays.jpg

Table of Content

Project Description¶

  • Background
  • Objective
  • Data Dictionary
  • Importing required packages

Data Ingestion¶

  • Preliminary Drops
  • Data Imputation of categorical variables
  • Feature Engineering - Encoding Categorical Variables

Exploratory Data Analysis (EDA)¶

  • Univariate Analysis
  • Correlation Matrix
  • Bivariate Analysis
  • Multivariate Analysis

Insights based on EDA¶

  • Customer Profile
  • Other exploratory deep dive
  • Key meaningful observations on the relationship between variables

Data Pre-processing¶

  • Preparing the data for analysis
  • Outlier Detection (treatment, if needed)
  • Missing value Treatment - Impute with KNN
  • Feature Engineering
  • Prepare data for modelling and check the split

Model Evaluation Prediction¶

Model building - Bagging¶

  • Build Bagging classifier, Random Forest, and Decision Tree
  • Comments on model performance

Model performance evaluation and improvement - Bagging¶

  • Right metric for model performance evaluation (and why?)
  • Comment on the model performance after tuning the Decision Tree, Bagging, and Random Forest classifier to improve the model performance.

Model building - Boosting¶

  • Build Adaboost, GradientBoost, and XGBoost classifiers
  • Comment on model performance

Model performance evaluation and improvement - Boosting¶

  • Right metric for model performance evaluation (and why?)
  • Comment on model performance after tuning the AdaBoost, Gradient Boosting, and XGB Classifier on the appropriate metric to improve the model performance

Actionable Insights & Recommendations¶

  • Model performance comparison on various metrics
  • Key takeaways
  • Advice to grow the business?

Project Background

  • You are a Data Scientist for a tourism company named "Visit with us". The Policy Maker of the company wants to enable and establish a viable business model to expand the customer base.
  • A viable business model is a central concept that helps you to understand the existing ways of doing the business and how to change the ways for the benefit of the tourism sector.
  • One of the ways to expand the customer base is to introduce a new offering of packages.
  • Currently, there are 5 types of packages the company is offering - Basic, Standard, Deluxe, Super Deluxe, King. Looking at the data of the last year, we observed that 18% of the customers purchased the packages.
  • However, it was difficult to identify the potential customers because customers were contacted at random without looking at the available information.
  • The company is now planning to launch a new product i.e. Wellness Tourism Package. Wellness Tourism is defined as Travel that allows the traveler to maintain, enhance or kick-start a healthy lifestyle, and support or increase one's sense of well-being.
  • This time company wants to harness the available data of existing and potential customers to target the right customers.
  • You as a Data Scientist at "Visit with us" travel company has to analyze the customers' data and information to provide recommendations to the Policy Maker and build a model to predict the potential customer who is going to purchase the newly introduced travel package. The model will be built to make predictions before a customer is contacted.

Objective

  • To predict which customer is more likely to purchase the newly introduced travel package.

Data Dictionary

Customer details:¶

  • CustomerID: Unique customer ID
  • ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
  • Age: Age of customer
  • TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
  • CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3. It's the city the customer lives in.
  • DurationOfPitch: Duration of the pitch by a salesperson to the customer
  • Occupation: Occupation of customer
  • Gender: Gender of customer
  • NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer
  • NumberOfFollowups: Total number of follow-ups has been done by sales person after sales pitch
  • ProductPitched: Product pitched by the salesperson
  • PreferredPropertyStar: Preferred hotel property rating by customer
  • MaritalStatus: Marital status of customer
  • NumberOfTrips: Average number of trips in a year by customer
  • Passport: The customer has a passport or not (0: No, 1: Yes)
  • PitchSatisfactionScore: Sales pitch satisfaction score
  • OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
  • NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
  • Designation: Designation of the customer in the current organization
  • MonthlyIncome: Gross monthly income of the customer

Customer interaction data:¶

  • PitchSatisfactionScore: Sales pitch satisfaction score
  • ProductPitched: Product pitched by the salesperson
  • NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
  • DurationOfPitch: Duration of the pitch by a salesperson to the customer

Importing required packages

In [1]:
import sys
import os
In [2]:
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:95% !important; }</style>"))
display(HTML("<style>.output_result { max-width:95% !important; }</style>"))
In [3]:
# # this will help in making the Python code more structured automatically (good coding practice)
# %load_ext nb_black
In [4]:
# Library to suppress warnings or deprecation notes
import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# library to help imputing mode
from scipy.stats import mode
import scipy.stats as stats

# Library to split data
from sklearn.model_selection import train_test_split

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries to build decision tree classifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV

# To build model for prediction
from sklearn import tree
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
    precision_recall_curve,
    roc_curve,
    make_scorer,
)
In [5]:
# to display and store Matplotlib plots within a Python Jupyter notebook
%matplotlib inline

# enable retina display
%config InlineBackend.figure_format='retina'
In [6]:
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

np.set_printoptions(edgeitems=20, linewidth=100)
np.set_printoptions(suppress=True)
pd.set_option("expand_frame_repr", False)

sns.set_style(style="darkgrid")

Data Ingestion

Read the dataset¶

In [7]:
# read Loan_Modelling.csv file
data_path = "/content/sample_data/Tourism.xlsx"
# data_file = "Tourism.xlsx"
data = pd.read_excel(data_path, sheet_name="Tourism")
data
Out[7]:
CustomerID ProdTaken Age TypeofContact CityTier DurationOfPitch Occupation Gender NumberOfPersonVisiting NumberOfFollowups ProductPitched PreferredPropertyStar MaritalStatus NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting Designation MonthlyIncome
0 200000 1 41.0 Self Enquiry 3 6.0 Salaried Female 3 3.0 Deluxe 3.0 Single 1.0 1 2 1 0.0 Manager 20993.0
1 200001 0 49.0 Company Invited 1 14.0 Salaried Male 3 4.0 Deluxe 4.0 Divorced 2.0 0 3 1 2.0 Manager 20130.0
2 200002 1 37.0 Self Enquiry 1 8.0 Free Lancer Male 3 4.0 Basic 3.0 Single 7.0 1 3 0 0.0 Executive 17090.0
3 200003 0 33.0 Company Invited 1 9.0 Salaried Female 2 3.0 Basic 3.0 Divorced 2.0 1 5 1 1.0 Executive 17909.0
4 200004 0 NaN Self Enquiry 1 8.0 Small Business Male 2 3.0 Basic 4.0 Divorced 1.0 0 5 1 0.0 Executive 18468.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4883 204883 1 49.0 Self Enquiry 3 9.0 Small Business Male 3 5.0 Deluxe 4.0 Unmarried 2.0 1 1 1 1.0 Manager 26576.0
4884 204884 1 28.0 Company Invited 1 31.0 Salaried Male 4 5.0 Basic 3.0 Single 3.0 1 3 1 2.0 Executive 21212.0
4885 204885 1 52.0 Self Enquiry 3 17.0 Salaried Female 4 4.0 Standard 4.0 Married 7.0 0 1 1 3.0 Senior Manager 31820.0
4886 204886 1 19.0 Self Enquiry 3 16.0 Small Business Male 3 4.0 Basic 3.0 Single 3.0 0 5 0 2.0 Executive 20289.0
4887 204887 1 36.0 Self Enquiry 1 14.0 Salaried Male 4 4.0 Basic 4.0 Unmarried 3.0 1 3 1 2.0 Executive 24041.0

4888 rows × 20 columns

In [8]:
# copying data to another varaible to avoid any changes to original data
df = data.copy()  # dataframe for `travel pack` data

View the first and last 5 rows of the dataset.¶

In [9]:
df.head()
Out[9]:
CustomerID ProdTaken Age TypeofContact CityTier DurationOfPitch Occupation Gender NumberOfPersonVisiting NumberOfFollowups ProductPitched PreferredPropertyStar MaritalStatus NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting Designation MonthlyIncome
0 200000 1 41.0 Self Enquiry 3 6.0 Salaried Female 3 3.0 Deluxe 3.0 Single 1.0 1 2 1 0.0 Manager 20993.0
1 200001 0 49.0 Company Invited 1 14.0 Salaried Male 3 4.0 Deluxe 4.0 Divorced 2.0 0 3 1 2.0 Manager 20130.0
2 200002 1 37.0 Self Enquiry 1 8.0 Free Lancer Male 3 4.0 Basic 3.0 Single 7.0 1 3 0 0.0 Executive 17090.0
3 200003 0 33.0 Company Invited 1 9.0 Salaried Female 2 3.0 Basic 3.0 Divorced 2.0 1 5 1 1.0 Executive 17909.0
4 200004 0 NaN Self Enquiry 1 8.0 Small Business Male 2 3.0 Basic 4.0 Divorced 1.0 0 5 1 0.0 Executive 18468.0
In [10]:
df.tail()
Out[10]:
CustomerID ProdTaken Age TypeofContact CityTier DurationOfPitch Occupation Gender NumberOfPersonVisiting NumberOfFollowups ProductPitched PreferredPropertyStar MaritalStatus NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting Designation MonthlyIncome
4883 204883 1 49.0 Self Enquiry 3 9.0 Small Business Male 3 5.0 Deluxe 4.0 Unmarried 2.0 1 1 1 1.0 Manager 26576.0
4884 204884 1 28.0 Company Invited 1 31.0 Salaried Male 4 5.0 Basic 3.0 Single 3.0 1 3 1 2.0 Executive 21212.0
4885 204885 1 52.0 Self Enquiry 3 17.0 Salaried Female 4 4.0 Standard 4.0 Married 7.0 0 1 1 3.0 Senior Manager 31820.0
4886 204886 1 19.0 Self Enquiry 3 16.0 Small Business Male 3 4.0 Basic 3.0 Single 3.0 0 5 0 2.0 Executive 20289.0
4887 204887 1 36.0 Self Enquiry 1 14.0 Salaried Male 4 4.0 Basic 4.0 Unmarried 3.0 1 3 1 2.0 Executive 24041.0

Understand the shape of the dataset.¶

In [11]:
df.shape
Out[11]:
(4888, 20)
  • The original dataset has 4888 rows and 20 columns of data

Check the data types of the columns for the dataset.¶

In [12]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4888 entries, 0 to 4887
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CustomerID                4888 non-null   int64  
 1   ProdTaken                 4888 non-null   int64  
 2   Age                       4662 non-null   float64
 3   TypeofContact             4863 non-null   object 
 4   CityTier                  4888 non-null   int64  
 5   DurationOfPitch           4637 non-null   float64
 6   Occupation                4888 non-null   object 
 7   Gender                    4888 non-null   object 
 8   NumberOfPersonVisiting    4888 non-null   int64  
 9   NumberOfFollowups         4843 non-null   float64
 10  ProductPitched            4888 non-null   object 
 11  PreferredPropertyStar     4862 non-null   float64
 12  MaritalStatus             4888 non-null   object 
 13  NumberOfTrips             4748 non-null   float64
 14  Passport                  4888 non-null   int64  
 15  PitchSatisfactionScore    4888 non-null   int64  
 16  OwnCar                    4888 non-null   int64  
 17  NumberOfChildrenVisiting  4822 non-null   float64
 18  Designation               4888 non-null   object 
 19  MonthlyIncome             4655 non-null   float64
dtypes: float64(7), int64(7), object(6)
memory usage: 763.9+ KB

Check the data of the "object" datatype columns.¶

In [13]:
cols_obj = df.select_dtypes(["object"])
cols_obj.columns
Out[13]:
Index(['TypeofContact', 'Occupation', 'Gender', 'ProductPitched',
       'MaritalStatus', 'Designation'],
      dtype='object')
In [14]:
# Checking value counts of categorical variables
for i in cols_obj:
    print(f'Unique values in "{i}" are :')
    df_concat = pd.concat(
        [
            df[i].value_counts().to_frame(),
            round(
                df[i].value_counts(normalize=True).to_frame().rename(columns={i: "%"})
                * 100,
                2,
            ),
        ],
        axis=1,
    )
    print(df_concat)
    print("*" * 50)
Unique values in "TypeofContact" are :
                 TypeofContact      %
Self Enquiry              3444  70.82
Company Invited           1419  29.18
**************************************************
Unique values in "Occupation" are :
                Occupation      %
Salaried              2368  48.45
Small Business        2084  42.64
Large Business         434   8.88
Free Lancer              2   0.04
**************************************************
Unique values in "Gender" are :
         Gender      %
Male       2916  59.66
Female     1817  37.17
Fe Male     155   3.17
**************************************************
Unique values in "ProductPitched" are :
              ProductPitched      %
Basic                   1842  37.68
Deluxe                  1732  35.43
Standard                 742  15.18
Super Deluxe             342   7.00
King                     230   4.71
**************************************************
Unique values in "MaritalStatus" are :
           MaritalStatus      %
Married             2340  47.87
Divorced             950  19.44
Single               916  18.74
Unmarried            682  13.95
**************************************************
Unique values in "Designation" are :
                Designation      %
Executive              1842  37.68
Manager                1732  35.43
Senior Manager          742  15.18
AVP                     342   7.00
VP                      230   4.71
**************************************************

Observations

  • On the TypeofContact the customer has been contacted only 30% of the time. The most frequent practice (70%) has been the customer contacting the company by themselves.
  • Most frequent Occupation are 'Salaried' and 'Small Business' altogether at 90% and a very low percentage for 'Large Business' and 'rare' occassions for 'Free Lancer'.
  • Gender has a third mistyped value "Fe Male". We will fix this below.
  • The most frequent 'pitched' products for the variable ProductPitched are 'Basic' and 'Deluxe' with almost 73% altogether. The rest is distributed among 'Standard', 'Super Deluxe', and 'King'.
  • For MaritalStatus we observe a preference between 'Married' customers around 48%. The sum of 'Divorce', 'Single', and 'Unmarried' amount to 52%.
  • For the customer levels in their own organization we see the majority account to about 73% percent in the position of 'Executive' and 'Manager' under Designation.
In [15]:
# Removing `CustomerID` variable from the dataset
df.drop(axis=1, columns=["CustomerID"], inplace=True)
In [16]:
# fixing 'Fe Male' typo on 'Gender'
df["Gender"] = df["Gender"].apply(lambda x: "Female" if x == "Fe Male" else x)

Summary of the dataset (numerical)¶

In [17]:
# Describing only numerical variables
df.describe(include=[np.int64, np.float64]).T
Out[17]:
count mean std min 25% 50% 75% max
ProdTaken 4888.0 0.188216 0.390925 0.0 0.0 0.0 0.0 1.0
Age 4662.0 37.622265 9.316387 18.0 31.0 36.0 44.0 61.0
CityTier 4888.0 1.654255 0.916583 1.0 1.0 1.0 3.0 3.0
DurationOfPitch 4637.0 15.490835 8.519643 5.0 9.0 13.0 20.0 127.0
NumberOfPersonVisiting 4888.0 2.905074 0.724891 1.0 2.0 3.0 3.0 5.0
NumberOfFollowups 4843.0 3.708445 1.002509 1.0 3.0 4.0 4.0 6.0
PreferredPropertyStar 4862.0 3.581037 0.798009 3.0 3.0 3.0 4.0 5.0
NumberOfTrips 4748.0 3.236521 1.849019 1.0 2.0 3.0 4.0 22.0
Passport 4888.0 0.290917 0.454232 0.0 0.0 0.0 1.0 1.0
PitchSatisfactionScore 4888.0 3.078151 1.365792 1.0 2.0 3.0 4.0 5.0
OwnCar 4888.0 0.620295 0.485363 0.0 0.0 1.0 1.0 1.0
NumberOfChildrenVisiting 4822.0 1.187267 0.857861 0.0 1.0 1.0 2.0 3.0
MonthlyIncome 4655.0 23619.853491 5380.698361 1000.0 20346.0 22347.0 25571.0 98678.0

Observations

  • "ProdTaken" is the dependent variable - type integer. We will convert this to 'categorical'.
  • Looks like CityTier, NumberOfPersonVisiting, NumberOfFollowups, PreferredPropertyStar, Passport, PitchSatisfactionScore, OwnCar, and NumberOfChildrenVisiting have a set of discrete values and can be converted to a data type 'categorical'.
  • The independent variables TypeofContact, Occupation, Gender, ProductPitched, MaritalStatus, and Designation are the other 'categorical' variables.
  • The following variables NumberOfFollowups, PreferredPropertyStar, NumberOfTrips, and NumberOfChildrenVisiting are "float64" datatype but they should be "int64" datatype. We will change them.
  • Converting variables NumberOfFollowups, PreferredPropertyStar, NumberOfTrips, and NumberOfChildrenVisiting to integers.
In [18]:
# list of float64 datatype to int64
features_float64 = [
    "NumberOfFollowups",
    "PreferredPropertyStar",
    "NumberOfTrips",
    "NumberOfChildrenVisiting",
]
# converting to int64 and printing #missing values
for feature in features_float64:
    df[feature] = df[feature].astype(pd.Int64Dtype())
    print(
        f"feature: ['{feature}'] has {df[feature].isnull().sum()} missing (NaN) values."
    )
feature: ['NumberOfFollowups'] has 45 missing (NaN) values.
feature: ['PreferredPropertyStar'] has 26 missing (NaN) values.
feature: ['NumberOfTrips'] has 140 missing (NaN) values.
feature: ['NumberOfChildrenVisiting'] has 66 missing (NaN) values.

Creating categorical variables¶

In [19]:
# Set of dependent/independent variables to be converted to categorical
col_cat = [
    "ProdTaken",
    "CityTier",
    "NumberOfPersonVisiting",
    "NumberOfFollowups",
    "PreferredPropertyStar",
    "Passport",
    "PitchSatisfactionScore",
    "OwnCar",
    "NumberOfChildrenVisiting",
    "TypeofContact",
    "Occupation",
    "Gender",
    "ProductPitched",
    "MaritalStatus",
    "Designation",
]

# loop to convert indep. variables with discrete values to `categorical`
for col in col_cat:
    df[col] = df[col].astype("category")

Revisiting categorical variables¶

In [20]:
# Checking value counts of categorical variables
for i in col_cat:
    print(f'Unique values in "{i}" are :')
    df_concat = pd.concat(
        [
            df[i].value_counts().to_frame(),
            round(
                df[i].value_counts(normalize=True).to_frame().rename(columns={i: "%"})
                * 100,
                2,
            ),
        ],
        axis=1,
    )
    print(df_concat)
    print("*" * 50)
Unique values in "ProdTaken" are :
   ProdTaken      %
0       3968  81.18
1        920  18.82
**************************************************
Unique values in "CityTier" are :
   CityTier      %
1      3190  65.26
3      1500  30.69
2       198   4.05
**************************************************
Unique values in "NumberOfPersonVisiting" are :
   NumberOfPersonVisiting      %
3                    2402  49.14
2                    1418  29.01
4                    1026  20.99
1                      39   0.80
5                       3   0.06
**************************************************
Unique values in "NumberOfFollowups" are :
   NumberOfFollowups      %
4               2068  42.70
3               1466  30.27
5                768  15.86
2                229   4.73
1                176   3.63
6                136   2.81
**************************************************
Unique values in "PreferredPropertyStar" are :
   PreferredPropertyStar      %
3                   2993  61.56
5                    956  19.66
4                    913  18.78
**************************************************
Unique values in "Passport" are :
   Passport      %
0      3466  70.91
1      1422  29.09
**************************************************
Unique values in "PitchSatisfactionScore" are :
   PitchSatisfactionScore      %
3                    1478  30.24
5                     970  19.84
1                     942  19.27
4                     912  18.66
2                     586  11.99
**************************************************
Unique values in "OwnCar" are :
   OwnCar      %
1    3032  62.03
0    1856  37.97
**************************************************
Unique values in "NumberOfChildrenVisiting" are :
   NumberOfChildrenVisiting      %
1                      2080  43.14
2                      1335  27.69
0                      1082  22.44
3                       325   6.74
**************************************************
Unique values in "TypeofContact" are :
                 TypeofContact      %
Self Enquiry              3444  70.82
Company Invited           1419  29.18
**************************************************
Unique values in "Occupation" are :
                Occupation      %
Salaried              2368  48.45
Small Business        2084  42.64
Large Business         434   8.88
Free Lancer              2   0.04
**************************************************
Unique values in "Gender" are :
        Gender      %
Male      2916  59.66
Female    1972  40.34
**************************************************
Unique values in "ProductPitched" are :
              ProductPitched      %
Basic                   1842  37.68
Deluxe                  1732  35.43
Standard                 742  15.18
Super Deluxe             342   7.00
King                     230   4.71
**************************************************
Unique values in "MaritalStatus" are :
           MaritalStatus      %
Married             2340  47.87
Divorced             950  19.44
Single               916  18.74
Unmarried            682  13.95
**************************************************
Unique values in "Designation" are :
                Designation      %
Executive              1842  37.68
Manager                1732  35.43
Senior Manager          742  15.18
AVP                     342   7.00
VP                      230   4.71
**************************************************

Data Imputation on categorical variables¶

Which categorical variables have 'missing' values?¶

In [21]:
cat_vars = [
    "TypeofContact",
    "Occupation",
    "Gender",
    "ProductPitched",
    "MaritalStatus",
    "Designation",
]

for feature in cat_vars:
    print(
        f"feature: ['{feature}'] has {df[feature].isnull().sum()} missing (NaN) values."
    )
feature: ['TypeofContact'] has 25 missing (NaN) values.
feature: ['Occupation'] has 0 missing (NaN) values.
feature: ['Gender'] has 0 missing (NaN) values.
feature: ['ProductPitched'] has 0 missing (NaN) values.
feature: ['MaritalStatus'] has 0 missing (NaN) values.
feature: ['Designation'] has 0 missing (NaN) values.
  • Only "TypeofContact" has 25 missing (NaN) values.

Imputing data for "TypeofContact" with missing values¶

In [22]:
# current possible values for "TypeofContact"
df["TypeofContact"].value_counts()
Out[22]:
Self Enquiry       3444
Company Invited    1419
Name: TypeofContact, dtype: int64
  • As we saw above, the proportion of "Self Enquiry" to "Company Invited" is 70 to 30 approx.
  • We will impute the 'mode' of "TypeofContact" for the missing data in the column.
In [23]:
# calculating the 'mode' for the feature
mode_TypeofContact = mode(df["TypeofContact"])[0][0]
print(f"The 'mode' for the feature \"TypeofContact\" is '{mode_TypeofContact}'")
df.loc[df["TypeofContact"].isnull(), "TypeofContact"] = mode_TypeofContact
The 'mode' for the feature "TypeofContact" is 'Self Enquiry'

Feature Engineering¶

Encoding Categorical Variables¶

  • The following categorical variables will be encoded:
    • TypeofContact
    • Occupation
    • Gender
    • ProductPitched
    • MaritalStatus
    • Designation
In [24]:
# list of columns to encode
cols_to_encode = [
    "TypeofContact",
    "Occupation",
    "Gender",
    "ProductPitched",
    "MaritalStatus",
    "Designation",
]

# maps for columns to encode
TypeofContact_dict = {"Self Enquiry": 0, "Company Invited": 1}
Occupation_dict = {
    "Salaried": 0,
    "Small Business": 1,
    "Large Business": 2,
    "Free Lancer": 3,
}
Gender_dict = {"Male": 0, "Female": 1}
ProductPitched_dict = {
    "Basic": 0,
    "Deluxe": 1,
    "Standard": 2,
    "Super Deluxe": 3,
    "King": 4,
}
MaritalStatus_dict = {"Married": 0, "Divorced": 1, "Single": 2, "Unmarried": 3}
Designation_dict = {
    "Executive": 0,
    "Manager": 1,
    "Senior Manager": 2,
    "AVP": 3,
    "VP": 4,
}

# index of dictionaries of columns to encode
enc_list_dicts = {
    0: TypeofContact_dict,
    1: Occupation_dict,
    2: Gender_dict,
    3: ProductPitched_dict,
    4: MaritalStatus_dict,
    5: Designation_dict,
}
In [25]:
# encoding columns to encode
for i, feature in enumerate(cols_to_encode):
    df[feature + "_num"] = df[feature].map(enc_list_dicts[i])
    print(feature, "\n", df[feature + "_num"].value_counts())
    print(80 * "*")
TypeofContact 
 0    3469
1    1419
Name: TypeofContact_num, dtype: int64
********************************************************************************
Occupation 
 0    2368
1    2084
2     434
3       2
Name: Occupation_num, dtype: int64
********************************************************************************
Gender 
 0    2916
1    1972
Name: Gender_num, dtype: int64
********************************************************************************
ProductPitched 
 0    1842
1    1732
2     742
3     342
4     230
Name: ProductPitched_num, dtype: int64
********************************************************************************
MaritalStatus 
 0    2340
1     950
2     916
3     682
Name: MaritalStatus_num, dtype: int64
********************************************************************************
Designation 
 0    1842
1    1732
2     742
3     342
4     230
Name: Designation_num, dtype: int64
********************************************************************************

Removing source of encoded columns (cols_to_encode)¶

In [26]:
df.drop(
    labels=[
        "TypeofContact",
        "Occupation",
        "Gender",
        "ProductPitched",
        "MaritalStatus",
        "Designation",
    ],
    axis=1,
    inplace=True,
)
In [27]:
df
Out[27]:
ProdTaken Age CityTier DurationOfPitch NumberOfPersonVisiting NumberOfFollowups PreferredPropertyStar NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting MonthlyIncome TypeofContact_num Occupation_num Gender_num ProductPitched_num MaritalStatus_num Designation_num
0 1 41.0 3 6.0 3 3 3 1 1 2 1 0 20993.0 0 0 1 1 2 1
1 0 49.0 1 14.0 3 4 4 2 0 3 1 2 20130.0 1 0 0 1 1 1
2 1 37.0 1 8.0 3 4 3 7 1 3 0 0 17090.0 0 3 0 0 2 0
3 0 33.0 1 9.0 2 3 3 2 1 5 1 1 17909.0 1 0 1 0 1 0
4 0 NaN 1 8.0 2 3 4 1 0 5 1 0 18468.0 0 1 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4883 1 49.0 3 9.0 3 5 4 2 1 1 1 1 26576.0 0 1 0 1 3 1
4884 1 28.0 1 31.0 4 5 3 3 1 3 1 2 21212.0 1 0 0 0 2 0
4885 1 52.0 3 17.0 4 4 4 7 0 1 1 3 31820.0 0 0 1 2 0 2
4886 1 19.0 3 16.0 3 4 3 3 0 5 0 2 20289.0 0 1 0 0 2 0
4887 1 36.0 1 14.0 4 4 4 3 1 3 1 2 24041.0 0 0 0 0 3 0

4888 rows × 19 columns

Check for missing data¶

In [28]:
df.isnull().sum()
Out[28]:
ProdTaken                     0
Age                         226
CityTier                      0
DurationOfPitch             251
NumberOfPersonVisiting        0
NumberOfFollowups            45
PreferredPropertyStar        26
NumberOfTrips               140
Passport                      0
PitchSatisfactionScore        0
OwnCar                        0
NumberOfChildrenVisiting     66
MonthlyIncome               233
TypeofContact_num             0
Occupation_num                0
Gender_num                    0
ProductPitched_num            0
MaritalStatus_num             0
Designation_num               0
dtype: int64
  • Still there are missing vaues in the dataset within columns: Age, DurationOfPitch, NumberOfFollowups, PreferredPropertyStar, NumberOfTrips, NumberOfChildrenVisiting, and MonthlyIncome.
In [29]:
# list of columns with missing values
miss_cols = [
    "Age",
    "DurationOfPitch",
    "NumberOfFollowups",
    "PreferredPropertyStar",
    "NumberOfTrips",
    "NumberOfChildrenVisiting",
    "MonthlyIncome",
]

# list of data types of columns with missing values
miss_cols_dtype = []
for col in miss_cols:
    miss_cols_dtype.append(f"{df[col].dtype}")

# dictionary containing columns with missing values and their data types
miss_cols_dict = dict(zip(miss_cols, miss_cols_dtype))
miss_cols_dict
Out[29]:
{'Age': 'float64',
 'DurationOfPitch': 'float64',
 'MonthlyIncome': 'float64',
 'NumberOfChildrenVisiting': 'category',
 'NumberOfFollowups': 'category',
 'NumberOfTrips': 'Int64',
 'PreferredPropertyStar': 'category'}
In [30]:
for k, v in miss_cols_dict.items():
    print(k, df[df[k].isnull()].shape[0])
Age 226
DurationOfPitch 251
NumberOfFollowups 45
PreferredPropertyStar 26
NumberOfTrips 140
NumberOfChildrenVisiting 66
MonthlyIncome 233

Exploratory Data Analysis (EDA)

Univariate Analysis¶

In [31]:
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.2f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot


def histogram_boxplot(data, feature, figsize=(16, 8), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots

    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column

    if bins:
        sns.histplot(
            data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
        )
    else:
        sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2)  # For histogram

    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram

    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [32]:
# function to plot univariate analysis based on feature data type
def plot_univariate(data, feature):
    """
    Plot univariate based on feature data type

    data: dataframe
    feature: dataframe column
    """
    print(data[feature].dtype)
    if data[feature].dtype in ("category", "Int64", "object"):
        labeled_barplot(data, feature, perc=True)
    else:
        histogram_boxplot(data, feature, kde=True)
    return


# function to find outliers values from the feature
def get_outliers(data, feature, factor=1.5, include_indexes=False):
    """
    function to find outliers

    data: dataframe
    feature: dataframe column
    """
    p25, p50, p75 = values = data[feature].describe().to_numpy()[-4:-1].tolist()
    iqr = p75 - p25
    loww = p25 - iqr * factor
    uppw = p75 + iqr * factor
    filt = (data[feature] > uppw) | (data[feature] < loww)
    if include_indexes == True:
        return data.loc[filt, feature].tolist(), data.loc[filt, feature].index
    return data.loc[filt, feature].tolist(), []


# function to extract some useful statistics from the feature distribution
def get_stats(data, feature):
    """
    Get mean, stdev, median, variance, and mode of a 'feature'
    data: dataframe
    feature: dataframe column
    """
    avg = np.nanmean(data[feature])
    stdev = np.nanstd(data[feature])
    median = np.nanmedian(data[feature])
    var = np.nanvar(data[feature])
    values, counts = np.unique(data[feature], return_counts=True)
    mode = values[np.argmax(counts)]
    return avg, stdev, median, var, mode

Observations on 'ProdTaken'¶

In [33]:
plot_univariate(df, "ProdTaken")
category
  • A little over 81% of the customers have not applied for a package yet.

Observations on 'Age'¶

In [34]:
labeled_barplot(df, "Age", perc=True)
In [35]:
plot_univariate(df, "Age")
float64
  • On Age we have a right skewed distribution with no visible outliers and a 'little' hump on the left side of the median as a signal of a possible bi-modal distribution.

Observations on 'CityTier'¶

In [36]:
plot_univariate(df, "CityTier")
category
  • On CityTier the dominant most frequent is tier 1 with 65% followed by tier 3 with around 31% and a small percentage on tier 2 at 4%.

Observations on 'DurationOfPitch'¶

In [37]:
labeled_barplot(df, "DurationOfPitch", perc=True)
In [38]:
plot_univariate(df, "DurationOfPitch")
float64
  • On DurationOfPitch we have a slightly skewed distribution (the outliers makes the tail very long). We will remove these outliers observations as it may be a exaggerated value for a sales pitch.

Treating outliers of 'DurationOfPitch'¶

In [39]:
# getting outliers values and location indexes of the outliers
outliers, bad_indexes = get_outliers(
    df, "DurationOfPitch", factor=2.5, include_indexes=True
)
outliers, bad_indexes
Out[39]:
([126.0, 127.0], Int64Index([1434, 3878], dtype='int64'))
  • There are two observations with 'DurationOfPitch' higher than two hours. We will remove them.
In [40]:
# display rows with 'bad_indexes'
df.loc[bad_indexes]
Out[40]:
ProdTaken Age CityTier DurationOfPitch NumberOfPersonVisiting NumberOfFollowups PreferredPropertyStar NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting MonthlyIncome TypeofContact_num Occupation_num Gender_num ProductPitched_num MaritalStatus_num Designation_num
1434 0 NaN 3 126.0 2 3 3 3 0 1 1 1 18482.0 1 0 0 0 0 0
3878 0 53.0 3 127.0 3 4 3 4 0 1 1 2 22160.0 1 0 0 0 0 0
In [41]:
# removing highly extreme outliers
df.drop(axis=0, index=bad_indexes, inplace=True)
In [42]:
# get some stats from the "DurationOfPitch" distribution without the presence of "outliers"
dfdata = df.loc[~df.index.isin(bad_indexes)].copy()
avg, std, median, var, mode = get_stats(dfdata, "DurationOfPitch")
avg, std, median, mode
Out[42]:
(15.442934196332255, 8.20244992074016, 13.0, 9.0)
  • We have removed observations containing highly extreme values.
  • For the missing data in the column 'DurationOfPitch' we will utilize the mode of 9 minutes.

Revisiting curated feature 'DurationOfPitch'¶

In [43]:
labeled_barplot(df, "DurationOfPitch", perc=True)
In [44]:
plot_univariate(df, "DurationOfPitch")
float64
  • Distribution for 'DurationOfPitch' is right-skeweed withh a second hump around the mean value indicating a bi-modal distribution.

Observations on 'NumberOfPersonVisiting'¶

In [45]:
plot_univariate(df, "NumberOfPersonVisiting")
category
  • On NumberOfPersonVisiting the most frequent number of visitors in the group is 3 people with 49% of the time, followed by a group of 2 people with 29% frequency and a group of 4 people with 21% of the time.

Observations on 'NumberOfFollowups'¶

In [46]:
plot_univariate(df, "NumberOfFollowups")
category
  • On NumberOfFollowups the most frequent number of follow ups is 4 with ~42% of the time, followed by 3-times with 30% and 5-times with ~16%. The rest of NumberOfFollowups with very low percentage are 1@4%, 2@5%, and 6@3%.

Observations on 'PreferredPropertyStar'¶

In [47]:
plot_univariate(df, "PreferredPropertyStar")
category
  • On PreferredPropertyStar the most frequent category is 3-stars with 61% of the time, followed by 5-stars with ~20% and 4-stars with 19%.

Observations on 'NumberOfTrips'¶

In [48]:
plot_univariate(df, "NumberOfTrips")
Int64
  • On NumberOfTrips the most frequent number is 2-trips with ~30% of the time, followed by 3-trips @ 22%, 1-trip @ ~13%, and [4, 5, 6, 7, and 8]-trips @ 10%, 10%, 7%, 5%, and 2% respectively.
  • However, we observe NumberOfTrips with values: 19, 20, 21, 22 at the same percentage of 0.02%. We will remove these observations as they may represent outliers with a very very low percentage of presence.
In [49]:
indexes_todrop = df[df["NumberOfTrips"].isin([19, 20, 21, 22])].index
df.drop(axis=0, index=indexes_todrop, inplace=True)

Observations on 'Passport'¶

In [50]:
plot_univariate(df, "Passport")
category
  • On 'Passport' almost 71% of customers don't have one. And only 29% has a 'Passport'.

Observations on 'PitchSatisfactionScore'¶

In [51]:
plot_univariate(df, "PitchSatisfactionScore")
category
  • On 'PitchSatisfactionScore' the most frequent rating is 3 with a frequency of 30%. Then, the frequency for ratings 1, 4, and 5 is 19%, 19% and 20% respectively.

  • The less frequent rating given by customers is 2 with a frequency of 12% of the time.

Observations on 'OwnCar'¶

In [52]:
plot_univariate(df, "OwnCar")
category
  • For the feature 'OwnCar', 62% owns a car, and 38% doesn't.

Observations on 'NumberOfChildrenVisiting'¶

In [53]:
plot_univariate(df, "NumberOfChildrenVisiting")
category
  • For the feature 'NumberOfChildrenVisiting', the most frequent number of chidren accompanying visitors is 1 with ~43% of the time, followed by 2 @ 27%, 0 @ 22% and the less frequent is 3 with ~7%.

Observations on 'MonthlyIncome'¶

In [54]:
plot_univariate(df, "MonthlyIncome")
float64
  • For the feature 'MonthlyIncome', we have an interesting display where the distribution of values being right-skewed showing several humps side-by-side.
  • The box plot shows a few significant outliers behind the median and a couple of 'extreme' values far away from the median.
  • We will do 'analysis of outliers' on this feature.

Treatment of Outliers on 'MonthlyIncome'¶

In [55]:
# getting outliers values and location indexes of the outliers
# Note: using a higher `amplitude` for the IQR multiplier of 2.5 instead of the common 1.5
outliers, bad_indexes = get_outliers(
    df, "MonthlyIncome", factor=2.5, include_indexes=True
)
# extreme outliers and indexes of their location
outliers, bad_indexes
Out[55]:
([95000.0, 1000.0, 98678.0, 4678.0],
 Int64Index([38, 142, 2482, 2586], dtype='int64'))
In [56]:
# observing the rows with the `bad_indexes` considered as `extreme outliers`
df.loc[
    bad_indexes,
    [
        "Passport",
        "MonthlyIncome",
        "Designation_num",
        "PreferredPropertyStar",
        "NumberOfPersonVisiting",
    ],
]
Out[56]:
Passport MonthlyIncome Designation_num PreferredPropertyStar NumberOfPersonVisiting
38 1 95000.0 0 NaN 2
142 1 1000.0 1 3 2
2482 1 98678.0 0 5 3
2586 1 4678.0 1 3 3
  • The 'MonthlyIncome' of the extreme outliers correspond to Executive (Designation_num = 0) and Manager (Designation_num = 1).

  • Those observations doesn't seem to be the right numbers for 'MonthlyIncome' so, we will remove these observations.

In [57]:
# drop extreme outliers out of the 'MonthlyIncome' column.
df.drop(index=bad_indexes, inplace=True)

converting 'MonthlyIncome' to 1,000's of 'MonthlyIncome'¶

In [58]:
df["MonthlyIncome"] = df["MonthlyIncome"] / 1000.00

Revisiting Observation on 'MonthlyIncome'¶

In [59]:
plot_univariate(df, "MonthlyIncome")
float64
  • Now the feature 'MonthlyIncome' has an interesting display with a right-skewed distribution showing a few humps side-by-side.

  • The box plot shows some outliers that are permitted for this case as they may enrich the dataset insights and therefore they will be kept.

Observations on 'TypeofContact_num'¶

In [60]:
plot_univariate(df, "TypeofContact_num")
category
In [61]:
{v: k for k, v in TypeofContact_dict.items()}
Out[61]:
{0: 'Self Enquiry', 1: 'Company Invited'}
  • reversed dictionary: {0: 'Self Enquiry', 1: 'Company Invited'}

  • On the feature 'TypeofContact_num', the class '0' has a presence of 71% while class '1' has 29%.

Observations on 'Occupation_num'¶

In [62]:
plot_univariate(df, "Occupation_num")
category
In [63]:
{v: k for k, v in Occupation_dict.items()}
Out[63]:
{0: 'Salaried', 1: 'Small Business', 2: 'Large Business', 3: 'Free Lancer'}
  • reversed dictionary: {0: 'Salaried', 1: 'Small Business', 2: 'Large Business', 3: 'Free Lancer'}
  • On the feature 'Occupation_num', the class = '0' has a presence of 48% corresponding to 'Salaried' while class = '1' has almost 43% corresponding to 'Small Business'.
  • The class = '2'-'Large Business' has almost 9% frequency and the class = '3'-'Free Lancer' will be removed as it has only 0.04% presence and no relevance for this analysis.

Observations on 'Gender_num'¶

In [64]:
plot_univariate(df, "Gender_num")
category
In [65]:
{v: k for k, v in Gender_dict.items()}
Out[65]:
{0: 'Male', 1: 'Female'}
  • reversed dictionary: {0: 'Male', 1: 'Female'}

  • On the feature 'Gender_num', the class = '0'-Male has a presence of 60%, while class = '1'-Female has 40%.

Observations on 'ProductPitched_num'¶

In [66]:
plot_univariate(df, "ProductPitched_num")
category
In [67]:
{v: k for k, v in ProductPitched_dict.items()}
Out[67]:
{0: 'Basic', 1: 'Deluxe', 2: 'Standard', 3: 'Super Deluxe', 4: 'King'}
  • reversed dictionary: {0: 'Basic', 1: 'Deluxe', 2: 'Standard', 3: 'Super Deluxe', 4: 'King'}

  • On the feature 'ProductPitched_num', the class = '0'-Basic has the most presence of 38%, followed by class = '1'-Deluxe with 36%.

  • For the class = '2'-Standard 15%, class = '3'-Super Deluxe 7%, and class = '4'-King a 5%

Observations on 'MaritalStatus_num'¶

In [68]:
plot_univariate(df, "MaritalStatus_num")
category
In [69]:
{v: k for k, v in MaritalStatus_dict.items()}
Out[69]:
{0: 'Married', 1: 'Divorced', 2: 'Single', 3: 'Unmarried'}
  • reversed dictionary: {0: 'Married', 1: 'Divorced', 2: 'Single', 3: 'Unmarried'}

  • Married represent 48%, followed by Divorced, Single with 19% & 19% and lastly Unmarried with 14%.

Observations on 'Designation_num'¶

In [70]:
plot_univariate(df, "Designation_num")
category
In [71]:
{v: k for k, v in Designation_dict.items()}
Out[71]:
{0: 'Executive', 1: 'Manager', 2: 'Senior Manager', 3: 'AVP', 4: 'VP'}
  • reversed dictionary: {0: 'Executive', 1: 'Manager', 2: 'Senior Manager', 3: 'AVP', 4: 'VP'}

  • Executive and Manager represents 38% and 35% percent respectively of the customers requesting packages.

Data Imputation with KNN

Review of the dataset status¶

In [72]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4878 entries, 0 to 4887
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   ProdTaken                 4878 non-null   category
 1   Age                       4653 non-null   float64 
 2   CityTier                  4878 non-null   category
 3   DurationOfPitch           4627 non-null   float64 
 4   NumberOfPersonVisiting    4878 non-null   category
 5   NumberOfFollowups         4833 non-null   category
 6   PreferredPropertyStar     4853 non-null   category
 7   NumberOfTrips             4738 non-null   Int64   
 8   Passport                  4878 non-null   category
 9   PitchSatisfactionScore    4878 non-null   category
 10  OwnCar                    4878 non-null   category
 11  NumberOfChildrenVisiting  4812 non-null   category
 12  MonthlyIncome             4645 non-null   float64 
 13  TypeofContact_num         4878 non-null   category
 14  Occupation_num            4878 non-null   category
 15  Gender_num                4878 non-null   category
 16  ProductPitched_num        4878 non-null   category
 17  MaritalStatus_num         4878 non-null   category
 18  Designation_num           4878 non-null   category
dtypes: Int64(1), category(15), float64(3)
memory usage: 398.3 KB

At this point of the Analysis we still have a few columns with missing data.¶

  • 'Age', 'DurationOfPitch', 'NumberOfFollowups', 'PreferredPropertyStar', 'NumberOfTrips', 'NumberOfChildrenVisiting', and 'MonthlyIncome' are the columns with missing data.

  • We will impute data in those columns through a well known technique called K-nearest-neighbour algorithm.

In [73]:
from sklearn.impute import KNNImputer
In [74]:
# # making a safe copy before proceeding with the knn algorithm
df_safe = df.copy()
In [75]:
# initialize knn imputer
imputer = KNNImputer(n_neighbors=10)
In [76]:
# columns of the dftest dataframe
columns = df_safe.columns
In [77]:
df_filled = imputer.fit_transform(df_safe)
In [78]:
# df_filled now is a dataframe of only float64 values
# we will re-cast those columns to its original datatype
df_filled = pd.DataFrame(df_filled, columns=columns)
In [79]:
# converting to original data types
for col in df_filled.columns:
    if df[col].dtype == "category":
        df_filled[col] = (
            pd.to_numeric(df_filled[col]).astype(np.int64).astype("category")
        )
    if df[col].dtype == "float64":
        df_filled[col] = df_filled[col].astype("float64")
    if df[col].dtype == "Int64":
        df_filled[col] = df_filled[col].fillna(0).astype(np.int64, errors="ignore")
In [80]:
# converting to original data types
for col in df_filled.columns:
    if df[col].dtype == "category":
        df_filled[col] = pd.to_numeric(df_filled[col]).astype(np.int64)
    if df[col].dtype == "float64":
        df_filled[col] = df_filled[col].astype("float64")
In [81]:
# converting 'Age' and 'DurationOfPitch' to int64
df_filled["Age"] = df_filled["Age"].astype(np.int64)
df_filled["DurationOfPitch"] = df_filled["DurationOfPitch"].astype(np.int64)

Showing 5 rows on the head and 5 rows on the tail¶

In [82]:
df_filled.head()
Out[82]:
ProdTaken Age CityTier DurationOfPitch NumberOfPersonVisiting NumberOfFollowups PreferredPropertyStar NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting MonthlyIncome TypeofContact_num Occupation_num Gender_num ProductPitched_num MaritalStatus_num Designation_num
0 1 41 3 6 3 3 3 1 1 2 1 0 20.993 0 0 1 1 2 1
1 0 49 1 14 3 4 4 2 0 3 1 2 20.130 1 0 0 1 1 1
2 1 37 1 8 3 4 3 7 1 3 0 0 17.090 0 3 0 0 2 0
3 0 33 1 9 2 3 3 2 1 5 1 1 17.909 1 0 1 0 1 0
4 0 29 1 8 2 3 4 1 0 5 1 0 18.468 0 1 0 0 1 0
In [83]:
df_filled.tail()
Out[83]:
ProdTaken Age CityTier DurationOfPitch NumberOfPersonVisiting NumberOfFollowups PreferredPropertyStar NumberOfTrips Passport PitchSatisfactionScore OwnCar NumberOfChildrenVisiting MonthlyIncome TypeofContact_num Occupation_num Gender_num ProductPitched_num MaritalStatus_num Designation_num
4873 1 49 3 9 3 5 4 2 1 1 1 1 26.576 0 1 0 1 3 1
4874 1 28 1 31 4 5 3 3 1 3 1 2 21.212 1 0 0 0 2 0
4875 1 52 3 17 4 4 4 7 0 1 1 3 31.820 0 0 1 2 0 2
4876 1 19 3 16 3 4 3 3 0 5 0 2 20.289 0 1 0 0 2 0
4877 1 36 1 14 4 4 4 3 1 3 1 2 24.041 0 0 0 0 3 0
In [84]:
# copying imputed dataframe back to the original name
df = df_filled.copy()
In [85]:
df.dtypes
Out[85]:
ProdTaken                     int64
Age                           int64
CityTier                      int64
DurationOfPitch               int64
NumberOfPersonVisiting        int64
NumberOfFollowups             int64
PreferredPropertyStar         int64
NumberOfTrips                 int64
Passport                      int64
PitchSatisfactionScore        int64
OwnCar                        int64
NumberOfChildrenVisiting      int64
MonthlyIncome               float64
TypeofContact_num             int64
Occupation_num                int64
Gender_num                    int64
ProductPitched_num            int64
MaritalStatus_num             int64
Designation_num               int64
dtype: object

Bivariate Analysis

In [86]:
# function to plot stacked bar chart


def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    #     print(tab1)
    #     print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 6))
    plt.legend(
        loc="lower left",
        frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

Correlation Matrix

In [87]:
plt.figure(figsize=(16, 10))
sns.heatmap(
    df[df.columns[1:]].corr(),
    annot=True,
    vmin=-1,
    vmax=1,
    fmt=".2f",
    cmap="Spectral",
)
plt.show()
  • We have to make some decisions before continuing the EDA process while doing Bi-Variate and Multi-Variate Analysis.

  • The correlation plot above shows a couple of interesting relationships among pairs of features.

  • We will be removing features that shows a high correlation between them and leave only one of them.

  • Starting with 'NumberOfChildrenVisiting' having a +60% correlation with 'NumberOfPersonVisiting'. We will eliminate the former one.

  • The feature 'MonthlyIncome' is positive correlated @ 86% with 'ProductPitched_num' and 'Designation_num'. We will eliminate the latter two.

In [88]:
# drop highly correlated columns as indicated in the section before
cols = ["NumberOfChildrenVisiting", "ProductPitched_num", "Designation_num"]
df.drop(axis=1, inplace=True, columns=cols)
In [89]:
df.columns
Out[89]:
Index(['ProdTaken', 'Age', 'CityTier', 'DurationOfPitch',
       'NumberOfPersonVisiting', 'NumberOfFollowups', 'PreferredPropertyStar',
       'NumberOfTrips', 'Passport', 'PitchSatisfactionScore', 'OwnCar',
       'MonthlyIncome', 'TypeofContact_num', 'Occupation_num', 'Gender_num',
       'MaritalStatus_num'],
      dtype='object')

Correlation Matrix after removing highly correlated features¶

In [90]:
plt.figure(figsize=(16, 10))
sns.heatmap(
    df[df.columns[1:]].corr(),
    annot=True,
    vmin=-1,
    vmax=1,
    fmt=".2f",
    cmap="Spectral",
)
plt.show()
  • Highest correlated value is 0.49 between 'MonthlyIncome' and 'Age'.

  • 'NumberOfFollowUps' and 'NumberOfPersonVisiting' are correlated up to 0.33.

  • 'MonthlyIncome' and 'NumberOfPersonVisiting' are correlated up to 0.22.

  • 'NumberOfTrips' and 'NumberOfPersonVisiting' are correlated at 0.19.

  • 'NumberOfTrips' and 'NumberOfFollowUps' are correlated at 0.14.

  • 'MonthlyIncome' and 'NumberOfTrips' are correlated at 0.13.

Age and ProdTaken¶

In [91]:
stacked_barplot(df, "Age", "ProdTaken")
  • Possibility of a package being taken is reduced as the customer age.

CityTier and ProdTaken¶

In [92]:
stacked_barplot(df, "CityTier", "ProdTaken")
  • Possibility of a package being taken is slightly lower on 'CityTier' = 1, while on the other tiers the change is around 25%.

DurationOfPitch and ProdTaken¶

In [93]:
stacked_barplot(df, "DurationOfPitch", "ProdTaken")
  • There is a high variability in the length of the pitch compared to the customer buying a package or not.

NumberOfPersonVisiting and ProdTaken¶

In [94]:
stacked_barplot(df, "NumberOfPersonVisiting", "ProdTaken")
  • The chance of buying a package is higher when 'NumberOfPersonVisiting' is 2, 3, or 4.

NumberOfFollowups and ProdTaken¶

In [95]:
stacked_barplot(df, "NumberOfFollowups", "ProdTaken")
  • The chance of buying a package is higher than 20% when 'NumberOfFollowups' is 5, or 6.

PreferredPropertyStar and ProdTaken¶

In [96]:
stacked_barplot(df, "PreferredPropertyStar", "ProdTaken")
  • The chance of buying a package is higher than 20% when 'PreferredPropertyStar' is 4, or 5.

NumberOfTrips and ProdTaken¶

In [97]:
stacked_barplot(df, "NumberOfTrips", "ProdTaken")
  • The chance of buying a package is too variable and may not depend heavily on the 'NumberOfTrips' made. However 'NumberOfTrips' equal to 7, or 8 looks more promissing than others.

Passport and ProdTaken¶

In [98]:
stacked_barplot(df, "Passport", "ProdTaken")
  • The chance of buying a package is much higher when customer has a 'Passport'.

PitchSatisfactionScore and ProdTaken¶

In [99]:
stacked_barplot(df, "PitchSatisfactionScore", "ProdTaken")
  • When the 'PitchSatisfactionScore' given is 3, or 5 the chance of buying a package is much higher.

OwnCar and ProdTaken¶

In [100]:
stacked_barplot(df, "OwnCar", "ProdTaken")
  • There is not significant difference when the customer own/not-own a car to increase chance of buying a package.

MonthlyIncome and ProdTaken¶

In [101]:
# stacked_barplot(df, "MonthlyIncome", "ProdTaken")
plt.figure(figsize=(10, 5))
sns.boxplot(x="ProdTaken", y="MonthlyIncome", data=df, showfliers=False)
plt.show()
  • The median of 'MonthlyIncome' is lower in the case a customer is buying a package.

TypeofContact_num and ProdTaken¶

In [102]:
stacked_barplot(df, "TypeofContact_num", "ProdTaken")
  • Not too much different chance of getting a package if the 'TypeOfContact' is 1, or 0.

Occupation_num and ProdTaken¶

In [103]:
stacked_barplot(df, "Occupation_num", "ProdTaken")
In [104]:
{v: k for k, v in Occupation_dict.items()}
Out[104]:
{0: 'Salaried', 1: 'Small Business', 2: 'Large Business', 3: 'Free Lancer'}
  • Definitely chances are very high of getting a package when 'Occupation' is 'Free Lancer' and almost 30% chance when 'Occupation' is 'Large Business'. However, 'Free Lancer' is a rare situation of being present.

Gender_num and ProdTaken¶

In [105]:
stacked_barplot(df, "Gender_num", "ProdTaken")
In [106]:
{v: k for k, v in Gender_dict.items()}
Out[106]:
{0: 'Male', 1: 'Female'}
  • Almost same chance of getting a package when 'Gender' is either 'Male' or 'Female'. Although for 'Male' is slightly higher chance.

MaritalStatus_num and ProdTaken¶

In [107]:
stacked_barplot(df, "MaritalStatus_num", "ProdTaken")
In [108]:
{v: k for k, v in MaritalStatus_dict.items()}
Out[108]:
{0: 'Married', 1: 'Divorced', 2: 'Single', 3: 'Unmarried'}
  • Higher chance of getting a package when 'MaritalStatus' is 'Single'

  • Second higher chance is for 'Unmarried'. Same chance is similar for 'Married' and 'Divorced'.

Multivariate Analysis

  • 'Multivariate Analysis' is used to study more complex sets of data than what 'Univariate Analysis', and 'Bivariate Analysis' methods can handle.
  • We will show a few combinations of features with 'ProdTaken'.

MonthlyIncome, NumberOfFollowups, and ProdTaken¶

In [109]:
sns.catplot(
    x="ProdTaken", y="MonthlyIncome", data=df, kind="bar", hue="NumberOfFollowups"
)
plt.xticks()
plt.show()
  • It seems the number of follow ups is not a deciding factor to buy a package.

  • In cases where a package is not bought (ProdTaken = 0) the higher salaries along with 2, 4, 5, and 6 FollowUps is not deciding factor.

  • However, when a package is bought, a number of FollowUps of 2, and 6 are always witnessing a buying of a package.

  • in essence there may not be a connection between buying a package and the number of FollowUps regardles of the higher salary.

MonthlyIncome, NumberOfFollowups, and ProdTaken¶

In [110]:
sns.catplot(
    x="ProdTaken", y="MonthlyIncome", data=df, kind="bar", hue="NumberOfPersonVisiting"
)
plt.xticks()
plt.show()
  • A package can be taken by customers only in the cases where the number of visiting persons is 2, 3, or 4.

  • Although we can have customers not buying on the same number of people visiting, this may indicate we have another 'deciding' variable.

MonthlyIncome, NumberOfTrips, and ProdTaken¶

In [111]:
sns.catplot(x="ProdTaken", y="MonthlyIncome", data=df, kind="bar", hue="NumberOfTrips")
plt.xticks()
plt.show()
  • Although we have customers buying or not buying with the same number of trips, there is a higher variance when the customer decide to buy a package.

Key Observations¶

Data Cleaning, Feature Engineering and Data Imputation.¶

  • Gender was cleaned and fixed by eliminating a typo in 'Fe male'.

  • We have eliminated the 'CustomerID' as it is not required.

  • On Feature Engineering we have encoded categorical variables and make them ready for model buidling.

  • 'Age', 'DurationOfPitch', 'NumberOfFollowups', 'PreferredPropertyStar', 'NumberOfTrips', 'NumberOfChildrenVisiting', and 'MonthlyIncome' are the columns with missing data.

  • We have performed Imputation in those columns through a well known technique called K-nearest-neighbour algorithm.

Insights based on EDA

Customer Profile¶

  • A little over 81% of the customers have not applied for a package yet.

  • On Age we have a right skewed distribution with no visible outliers and a 'little' hump on the left side of the median as a signal of a possible bi-modal distribution.

  • On CityTier the dominant most frequent is tier 1 with 65% followed by tier 3 with around 31% and a small percentage on tier 2 at 4%.

  • On DurationOfPitch we have a slightly skewed distribution (the outliers makes the tail very long). We will remove these outliers observations as it may be a exaggerated value for a sales pitch.

  • There are two observations with 'DurationOfPitch' higher than two hours. We will remove them.

  • We have removed observations containing highly extreme values.

  • For the missing data in the column 'DurationOfPitch' we will utilize the mode of 9 minutes.

  • Distribution for 'DurationOfPitch' is right-skeweed withh a second hump around the mean value indicating a bi-modal distribution.

  • On NumberOfPersonVisiting the most frequent number of visitors in the group is 3 people with 49% of the time, followed by a group of 2 people with 29% frequency and a group of 4 people with 21% of the time.

  • On NumberOfFollowups the most frequent number of follow ups is 4 with ~42% of the time, followed by 3-times with 30% and 5-times with ~16%. The rest of NumberOfFollowups with very low percentage are 1@4%, 2@5%, and 6@3%.

  • On PreferredPropertyStar the most frequent category is 3-stars with 61% of the time, followed by 5-stars with ~20% and 4-stars with 19%.

  • On NumberOfTrips the most frequent number is 2-trips with ~30% of the time, followed by 3-trips @ 22%, 1-trip @ ~13%, and [4, 5, 6, 7, and 8]-trips @ 10%, 10%, 7%, 5%, and 2% respectively.

  • However, we observe NumberOfTrips with values: 19, 20, 21, 22 at the same percentage of 0.02%. We will remove these observations as they may represent outliers with a very very low percentage of presence.

  • On 'Passport' almost 71% of customers don't have one. And only 29% has a 'Passport'.

  • On 'PitchSatisfactionScore' the most frequent rating is 3 with a frequency of 30%. Then, the frequency for ratings 1, 4, and 5 is 19%, 19% and 20% respectively.

  • The less frequent rating given by customers is 2 with a frequency of 12% of the time.

  • For the feature 'OwnCar', 62% owns a car, and 38% doesn't.

  • For the feature 'NumberOfChildrenVisiting', the most frequent number of chidren accompanying visitors is 1 with ~43% of the time, followed by 2 @ 27%, 0 @ 22% and the less frequent is 3 with ~7%.

  • For the feature 'MonthlyIncome', we have an interesting display where the distribution of values being right-skewed showing several humps side-by-side.

  • The box plot shows a few significant outliers behind the median and a couple of 'extreme' values far away from the median. It is a candidate for Outliers Analysis.

  • The 'MonthlyIncome' of the extreme outliers correspond to Executive (Designation_num = 0) and Manager (Designation_num = 1).

  • Those observations doesn't seem to be the right numbers for 'MonthlyIncome' so, we will remove these observations.

  • Now the feature 'MonthlyIncome' has an interesting display with a right-skewed distribution showing a few humps side-by-side.

  • The box plot shows some outliers that are permitted for this case as they may enrich the dataset insights and therefore they will be kept.

  • Now the feature 'MonthlyIncome' has an interesting display with a right-skewed distribution showing a few humps side-by-side.

  • The box plot shows some outliers that are permitted for this case as they may enrich the dataset insights and therefore they will be kept.

  • On the feature 'Occupation_num', the class = '0' has a presence of 48% corresponding to 'Salaried' while class = '1' has almost 43% corresponding to 'Small Business'.

  • The class = '2'-'Large Business' has almost 9% frequency and the class = '3'-'Free Lancer' will be removed as it has only 0.04% presence and no relevance for this analysis.

  • On the feature 'Gender_num', the class = '0'-Male has a presence of 60%, while class = '1'-Female has 40%.

  • On the feature 'ProductPitched_num', the class = '0'-Basic has the most presence of 38%, followed by class = '1'-Deluxe with 36%.

  • For the class = '2'-Standard 15%, class = '3'-Super Deluxe 7%, and class = '4'-King a 5%

  • Married represent 48%, followed by Divorced, Single with 19% & 19% and lastly Unmarried with 14%.

  • Executive and Manager represents 38% and 35% percent respectively of the customers requesting packages.

  • We have to make some decisions before continuing the EDA process while doing Bi-Variate and Multi-Variate Analysis.

  • The correlation plot above shows a couple of interesting relationships among pairs of features.

  • We will be removing features that shows a high correlation between them and leave only one of them.

  • Starting with 'NumberOfChildrenVisiting' having a +60% correlation with 'NumberOfPersonVisiting'. We will eliminate the former one.

  • The feature 'MonthlyIncome' is positive correlated @ 86% with 'ProductPitched_num' and 'Designation_num'. We will eliminate the latter two.

  • dropped highly correlated columns: cols = ["NumberOfChildrenVisiting", "ProductPitched_num", "Designation_num"]

  • Possibility of a package being taken is reduced as the customer age.

  • Possibility of a package being taken is slightly lower on 'CityTier' = 1, while on the other tiers the change is around 25%.

  • There is a high variability in the length of the pitch compared to the customer buying a package or not.

  • The chance of buying a package is higher when 'NumberOfPersonVisiting' is 2, 3, or 4.

  • The chance of buying a package is higher than 20% when 'NumberOfFollowups' is 5, or 6.

  • The chance of buying a package is higher than 20% when 'PreferredPropertyStar' is 4, or 5.

  • The chance of buying a package is too variable and may not depend heavily on the 'NumberOfTrips' made. However 'NumberOfTrips' equal to 7, or 8 looks more promissing than others.

  • The chance of buying a package is much higher when customer has a 'Passport'.

  • When the 'PitchSatisfactionScore' given is 3, or 5 the chance of buying a package is much higher.

  • There is not significant difference when the customer own/not-own a car to increase chance of buying a package.

  • The median of 'MonthlyIncome' is lower in the case a customer is buying a package.

  • Not too much different chance of getting a package if the 'TypeOfContact' is 1, or 0.

  • Definitely chances are very high of getting a package when 'Occupation' is 'Free Lancer' and almost 30% chance when 'Occupation' is 'Large Business'. However, 'Free Lancer' is a rare situation of being present.

  • Almost same chance of getting a package when 'Gender' is either 'Male' or 'Female'. Although for 'Male' is slightly higher chance.

  • Higher chance of getting a package when 'MaritalStatus' is 'Single'

  • Second higher chance is for 'Unmarried'. Same chance is similar for 'Married' and 'Divorced'.

  • It seems the number of follow ups is not a deciding factor to buy a package.

  • In cases where a package is not bought (ProdTaken = 0) the higher salaries along with 2, 4, 5, and 6 FollowUps is not deciding factor.

  • However, when a package is bought, a number of FollowUps of 2, and 6 are always witnessing a buying of a package.

  • in essence there may not be a connection between buying a package and the number of FollowUps regardles of the higher salary.

  • A package can be taken by customers only in the cases where the number of visiting persons is 2, 3, or 4.

  • Although we can have customers not buying on the same number of people visiting, this may indicate we have another 'deciding' variable.

  • Although we have customers buying or not buying with the same number of trips, there is a higher variance when the customer decide to buy a package.

Key meaningful observations on the relationship between variables¶

  • Highest correlated value is 0.49 between 'MonthlyIncome' and 'Age'.

  • 'NumberOfFollowUps' and 'NumberOfPersonVisiting' are correlated up to 0.33.

  • 'MonthlyIncome' and 'NumberOfPersonVisiting' are correlated up to 0.22.

  • 'NumberOfTrips' and 'NumberOfPersonVisiting' are correlated at 0.19.

  • 'NumberOfTrips' and 'NumberOfFollowUps' are correlated at 0.14.

  • 'MonthlyIncome' and 'NumberOfTrips' are correlated at 0.13.

Model Building

Split the data into train and test sets¶

  • When data (classification) exhibit a significant imbalance in the distribution of the target classes, it is good to use stratified sampling to ensure that relative class frequencies are approximately preserved in train and test sets.
  • This is done by setting the stratify parameter to target variable in the train_test_split function.
In [113]:
X = df.drop("ProdTaken", axis=1)
y = df.pop("ProdTaken")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=1, stratify=y
)
print(X_train.shape, X_test.shape)
(3414, 15) (1464, 15)
In [114]:
y.value_counts(True)
Out[114]:
0    0.811808
1    0.188192
Name: ProdTaken, dtype: float64
In [115]:
y_test.value_counts(True)
Out[115]:
0    0.811475
1    0.188525
Name: ProdTaken, dtype: float64

Handy functions¶

In [116]:
## Function to create confusion matrix
def make_confusion_matrix(model, y_actual, labels=[1, 0]):
    """
    model : classifier to predict values of X
    y_actual : ground truth

    """
    y_predict = model.predict(X_test)
    cm = confusion_matrix(y_actual, y_predict, labels=[0, 1])
    df_cm = pd.DataFrame(
        cm,
        index=[i for i in ["Actual - No", "Actual - Yes"]],
        columns=[i for i in ["Predicted - No", "Predicted - Yes"]],
    )
    group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
    labels = np.asarray(labels).reshape(2, 2)
    plt.figure(figsize=(10, 7))
    sns.heatmap(df_cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")


def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")


##  Function to calculate different metric scores of the model - Accuracy, Recall and Precision
def get_metrics_score(model, flag=True):
    """
    model : classifier to predict values of X

    """
    # defining an empty list to store train and test results
    score_list = []

    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)

    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)

    train_recall = recall_score(y_train, pred_train)
    test_recall = recall_score(y_test, pred_test)

    train_precision = precision_score(y_train, pred_train)
    test_precision = precision_score(y_test, pred_test)

    score_list.extend(
        (
            train_acc,
            test_acc,
            train_recall,
            test_recall,
            train_precision,
            test_precision,
        )
    )

    # If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
    if flag == True:
        print("Accuracy on training set : ", model.score(X_train, y_train))
        print("Accuracy on test set : ", model.score(X_test, y_test))
        print("Recall on training set : ", recall_score(y_train, pred_train))
        print("Recall on test set : ", recall_score(y_test, pred_test))
        print("Precision on training set : ", precision_score(y_train, pred_train))
        print("Precision on test set : ", precision_score(y_test, pred_test))

    return score_list  # returning the list with train and test scores


# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1,
        },
        index=[0],
    )

    return df_perf

Building the 'Bagging' Models

  • We are going to build 3 ensemble bagging models here - 'Bagging Classifier', 'Decision Tree Classifier' and 'Random Forest Classifier'.
  • First, let's build these models with default parameters and then use hyperparameter tuning to optimize the model performance.
  • We will calculate all three metrics - Accuracy, Precision and Recall but the metric of interest here is Recall.
  • Recall - It gives the ratio of True positives to Actual positives, so high Recall implies low false negatives, i.e. low chances of predicting a 'buyer' of a package as 'non buyer'.

Bagging Classifier¶

In [117]:
bagging = BaggingClassifier(random_state=1)
bagging.fit(X_train, y_train)
Out[117]:
BaggingClassifier(random_state=1)
In [118]:
confusion_matrix_sklearn(bagging, X_test, y_test)
In [119]:
bagging_model_train_perf = model_performance_classification_sklearn(
    bagging, X_train, y_train
)
print("Training performance \n", bagging_model_train_perf)
Training performance 
    Accuracy    Recall  Precision        F1
0   0.99297  0.962617        1.0  0.980952
In [120]:
bagging_model_test_perf = model_performance_classification_sklearn(
    bagging, X_test, y_test
)
print("Testing performance \n", bagging_model_test_perf)
Testing performance 
    Accuracy    Recall  Precision        F1
0  0.894809  0.528986   0.858824  0.654709
  • Bagging classifier is overfitting on the training set and is performing poorly on the test set in terms of recall.

Bagging Classifier with weighted decision tree

In [121]:
bagging_wt = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(
        criterion="gini", class_weight={0: 0.81, 1: 0.19}, random_state=1
    ),
    random_state=1,
)
bagging_wt.fit(X_train, y_train)
Out[121]:
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight={0: 0.81,
                                                                      1: 0.19},
                                                        random_state=1),
                  random_state=1)
In [122]:
confusion_matrix_sklearn(bagging_wt, X_test, y_test)
In [123]:
bagging_wt_model_train_perf = model_performance_classification_sklearn(
    bagging_wt, X_train, y_train
)
print("Training performance \n", bagging_wt_model_train_perf)
Training performance 
    Accuracy    Recall  Precision        F1
0  0.994142  0.971963   0.996805  0.984227
In [124]:
bagging_wt_model_test_perf = model_performance_classification_sklearn(
    bagging_wt, X_test, y_test
)
print("Testing performance \n", bagging_wt_model_test_perf)
Testing performance 
    Accuracy    Recall  Precision        F1
0  0.901639  0.615942   0.817308  0.702479
  • Bagging classifier with a weighted decision tree is giving very good accuracy and prediction but is not able to generalize well on test data in terms of recall.

Decision Tree Classifier¶

  • We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split.
  • If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.

  • In this case, we can pass a dictionary {0:0.81,1:0.19} to the model to specify the weight of each class and the decision tree will give more weightage to class 0.

  • class_weight is a hyperparameter for the decision tree classifier.

In [125]:
dtree = DecisionTreeClassifier(
    criterion="gini", class_weight={0: 0.81, 1: 0.19}, random_state=1
)
In [126]:
dtree.fit(X_train, y_train)
Out[126]:
DecisionTreeClassifier(class_weight={0: 0.81, 1: 0.19}, random_state=1)
In [127]:
confusion_matrix_sklearn(dtree, X_test, y_test)

Confusion Matrix -

  • Employee left and the model predicted it correctly that is employee will attrite : True Positive (observed=1,predicted=1)

  • Employee didn't leave and the model predicted employee will attrite : False Positive (observed=0,predicted=1)

  • Employee didn't leave and the model predicted employee will not attrite : True Negative (observed=0,predicted=0)

  • Employee left and the model predicted that employee won't : False Negative (observed=1,predicted=0)

In [128]:
dtree_model_train_perf = model_performance_classification_sklearn(
    dtree, X_train, y_train
)
print("Training performance \n", dtree_model_train_perf)
Training performance 
    Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
In [129]:
dtree_model_test_perf = model_performance_classification_sklearn(dtree, X_test, y_test)
print("Testing performance \n", dtree_model_test_perf)
Testing performance 
    Accuracy    Recall  Precision        F1
0  0.889344  0.724638   0.699301  0.711744
  • Decision tree is working well on the training data but is not able to generalize well on the test data concerning the recall.

Random Forest¶

In [130]:
rf = RandomForestClassifier(random_state=1)
rf.fit(X_train, y_train)
Out[130]:
RandomForestClassifier(random_state=1)
In [131]:
confusion_matrix_sklearn(rf, X_test, y_test)
In [132]:
rf_model_train_perf = model_performance_classification_sklearn(rf, X_train, y_train)
print("Training performance \n", rf_model_train_perf)
Training performance 
    Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
In [133]:
rf_model_test_perf = model_performance_classification_sklearn(rf, X_test, y_test)
print("Testing performance \n", rf_model_test_perf)
Testing performance 
    Accuracy    Recall  Precision        F1
0  0.910519  0.572464   0.923977  0.706935
  • Random Forest has performed well in terms of accuracy and precision, but it is not able to generalize well on the test data in terms of recall.

Random forest with class weights

In [134]:
rf_wt = RandomForestClassifier(class_weight={0: 0.81, 1: 0.19}, random_state=1)
rf_wt.fit(X_train, y_train)
Out[134]:
RandomForestClassifier(class_weight={0: 0.81, 1: 0.19}, random_state=1)
In [135]:
confusion_matrix_sklearn(rf_wt, X_test, y_test)
In [136]:
rf_wt_model_train_perf = model_performance_classification_sklearn(
    rf_wt, X_train, y_train
)
print("Training performance \n", rf_wt_model_train_perf)
Training performance 
    Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
In [137]:
rf_wt_model_test_perf = model_performance_classification_sklearn(rf_wt, X_test, y_test)
print("Testing performance \n", rf_wt_model_test_perf)
Testing performance 
    Accuracy    Recall  Precision        F1
0  0.913934  0.597826   0.916667  0.723684
  • There is not much improvement in metrics of weighted random forest as compared to the unweighted random forest.

Hyperparameter Tuning on Bagging Models

Using GridSearch for Hyperparameter tuning model¶

  • Hyperparameter tuning is also tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. i.e we'll use Grid search
  • Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
  • It is an exhaustive search that is performed on a the specific parameter values of a model.
  • The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Tuning Bagging Classifier

In [138]:
# grid search for bagging classifier
cl1 = DecisionTreeClassifier(random_state=1)
param_grid = {
    "base_estimator": [cl1],
    "n_estimators": [5, 7, 15, 51, 101],
    "max_features": [0.7, 0.8, 0.9, 1],
}

grid = GridSearchCV(
    BaggingClassifier(random_state=1, bootstrap=True),
    param_grid=param_grid,
    scoring="recall",
    cv=5,
)
grid.fit(X_train, y_train)
Out[138]:
GridSearchCV(cv=5, estimator=BaggingClassifier(random_state=1),
             param_grid={'base_estimator': [DecisionTreeClassifier(random_state=1)],
                         'max_features': [0.7, 0.8, 0.9, 1],
                         'n_estimators': [5, 7, 15, 51, 101]},
             scoring='recall')
In [139]:
## getting the best estimator
bagging_estimator = grid.best_estimator_
bagging_estimator.fit(X_train, y_train)
Out[139]:
BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
                  max_features=0.9, n_estimators=101, random_state=1)
In [140]:
confusion_matrix_sklearn(bagging_estimator, X_test, y_test)
In [141]:
bagging_estimator_model_train_perf = model_performance_classification_sklearn(
    bagging_estimator, X_train, y_train
)
print("Training performance \n", bagging_estimator_model_train_perf)
Training performance 
    Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
In [142]:
bagging_estimator_model_test_perf = model_performance_classification_sklearn(
    bagging_estimator, X_test, y_test
)
print("Testing performance \n", bagging_estimator_model_test_perf)
Testing performance 
    Accuracy    Recall  Precision        F1
0  0.914617  0.630435   0.883249  0.735729
  • Recall has improved but the accuracy and precision of the model has dropped drastically which is an indication that overall the model is making many mistakes.

Tuning Decision Tree

In [143]:
# Choose the type of classifier.
dtree_estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(2, 30),
    "min_samples_leaf": [1, 2, 5, 7, 10],
    "max_leaf_nodes": [2, 3, 5, 10, 15, None],
    "min_impurity_decrease": [0.0001, 0.001, 0.01, 0.1],
}

# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)

# Run the grid search
grid_obj = GridSearchCV(dtree_estimator, parameters, scoring=scorer)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
dtree_estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
dtree_estimator.fit(X_train, y_train)
Out[143]:
DecisionTreeClassifier(max_depth=19, min_impurity_decrease=0.0001,
                       random_state=1)
In [144]:
confusion_matrix_sklearn(dtree_estimator, X_test, y_test)
In [145]:
dtree_estimator_model_train_perf = model_performance_classification_sklearn(
    dtree_estimator, X_train, y_train
)
print("Training performance \n", dtree_estimator_model_train_perf)
Training performance 
    Accuracy    Recall  Precision        F1
0  0.988576  0.943925   0.995074  0.968825
In [146]:
dtree_estimator_model_test_perf = model_performance_classification_sklearn(
    dtree_estimator, X_test, y_test
)
print("Testing performance \n", dtree_estimator_model_test_perf)
Testing performance 
    Accuracy    Recall  Precision        F1
0  0.881148  0.648551   0.699219  0.672932
  • Overfitting in decision tree has reduced but the recall has also reduced.

Tuning Random Forest Classifier

In [147]:
# Choose the type of classifier.
rf_estimator = RandomForestClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {
    "n_estimators": [110, 251, 501],
    "min_samples_leaf": np.arange(1, 6, 1),
    "max_features": [0.7, 0.9, "log2", "auto"],
    "max_samples": [0.7, 0.9, None],
}

# Run the grid search
grid_obj = GridSearchCV(rf_estimator, parameters, scoring="recall", cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
rf_estimator.fit(X_train, y_train)
Out[147]:
RandomForestClassifier(max_features=0.9, n_estimators=501, random_state=1)
In [148]:
confusion_matrix_sklearn(rf_estimator, X_test, y_test)
In [149]:
rf_estimator_model_train_perf = model_performance_classification_sklearn(
    rf_estimator, X_train, y_train
)
print("Training performance \n", rf_estimator_model_train_perf)
Training performance 
    Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
In [150]:
rf_estimator_model_test_perf = model_performance_classification_sklearn(
    rf_estimator, X_test, y_test
)
print("Testing performance \n", rf_estimator_model_test_perf)
Testing performance 
    Accuracy    Recall  Precision        F1
0  0.919399  0.648551      0.895  0.752101
  • Random forest after tuning has given same performance as un-tuned random forest.

Building the 'Boosting' Models

  • We are going to build 3 ensemble boosting models here - 'AdaBoost Classifier', 'Gradient Boosting Classifier' and 'XGBoost Classifier'.
  • First, let's build these models with default parameters and then use hyperparameter tuning to optimize the model performance.
  • We will calculate all four metrics - Accuracy, Precision, Recall, and F1 Score but the metric of interest here is Recall.
  • Recall - It gives the ratio of True positives to Actual positives, so high Recall implies low false negatives, i.e. low chances of predicting a 'buyer' of a package as 'non buyer'.

AdaBoost Classifier

In [151]:
abc = AdaBoostClassifier(random_state=1)
abc.fit(X_train, y_train)
Out[151]:
AdaBoostClassifier(random_state=1)
In [152]:
# Using above defined function to get accuracy, recall and precision on train and test set
abc_score = get_metrics_score(abc)
Accuracy on training set :  0.8456356180433509
Accuracy on test set :  0.8360655737704918
Recall on training set :  0.3442367601246106
Recall on test set :  0.322463768115942
Precision on training set :  0.6758409785932722
Precision on test set :  0.6267605633802817
In [153]:
# Plot the confusion matrix
make_confusion_matrix(abc, y_test)

Gradient Boosting Classifier

In [154]:
gbc = GradientBoostingClassifier(random_state=1)
gbc.fit(X_train, y_train)
Out[154]:
GradientBoostingClassifier(random_state=1)
In [155]:
# Using above defined function to get accuracy, recall and precision on train and test set
gbc_score = get_metrics_score(gbc)
Accuracy on training set :  0.8863503222026948
Accuracy on test set :  0.8545081967213115
Recall on training set :  0.4672897196261682
Recall on test set :  0.3695652173913043
Precision on training set :  0.8670520231213873
Precision on test set :  0.723404255319149
In [156]:
# Plot the confusion matrix
make_confusion_matrix(gbc, y_test)

XGBoost Classifier

In [157]:
xgb = XGBClassifier(random_state=1, eval_metric="logloss")
xgb.fit(X_train, y_train)
Out[157]:
XGBClassifier(eval_metric='logloss', random_state=1)
In [158]:
# Using above defined function to get accuracy, recall and precision on train and test set
xgb_score = get_metrics_score(xgb)
Accuracy on training set :  0.880199179847686
Accuracy on test set :  0.85724043715847
Recall on training set :  0.43457943925233644
Recall on test set :  0.35507246376811596
Precision on training set :  0.8584615384615385
Precision on test set :  0.7596899224806202
In [159]:
make_confusion_matrix(xgb, y_test)

With default parameters:

  • AdaBoost classifier has better test accuracy among these 3 models.
  • GB classifier has least test accuracy and test recall.

Hyperparameter Tuning on Boosting Models

Tuning AdaBoost Classifier

  • An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
  • Some important hyperparamters are:
    • base_estimator: The base estimator from which the boosted ensemble is built. By default the base estimator is a decision tree with max_depth=1
    • n_estimators: The maximum number of estimators at which boosting is terminated. Default value is 50.
    • learning_rate: Learning rate shrinks the contribution of each classifier by learning_rate. There is a trade-off between learning_rate and n_estimators.
In [160]:
# Choose the type of classifier.
abc_tuned = AdaBoostClassifier(random_state=1)
In [161]:
# Grid of parameters to choose from
## add from article
parameters = {
    # Let's try different max_depth for base_estimator
    "base_estimator": [
        DecisionTreeClassifier(max_depth=1, random_state=1),
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
    "n_estimators": np.arange(10, 110, 10),
    "learning_rate": np.arange(0.1, 2, 0.1),
}
In [162]:
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
In [163]:
# Run the grid search
grid_obj = GridSearchCV(abc_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
In [164]:
# Set the clf to the best combination of parameters
abc_tuned = grid_obj.best_estimator_
In [165]:
# Fit the best algorithm to the data.
abc_tuned.fit(X_train, y_train)
Out[165]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=1.6, n_estimators=100, random_state=1)
In [166]:
# Using above defined function to get accuracy, recall and precision on train and test set
abc_tuned_score = get_metrics_score(abc_tuned)
Accuracy on training set :  0.986233157586409
Accuracy on test set :  0.8442622950819673
Recall on training set :  0.942367601246106
Recall on test set :  0.5434782608695652
Precision on training set :  0.983739837398374
Precision on test set :  0.5952380952380952
In [167]:
make_confusion_matrix(abc_tuned, y_test)

Insights¶

  • The model is overfitting the train data as train accuracy is much higher than the test accuracy.
  • The model has low test recall. This implies that the model is not good at identifying defaulters.
In [168]:
importances = abc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
In [169]:
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • 'MonthlyIncome' is the most important feature as per the tuned AdaBoost model.

Tuning Gradient Boosting Classifier

  • Most of the hyperparameters available are same as random forest classifier.
  • init: An estimator object that is used to compute the initial predictions. If ‘zero’, the initial raw predictions are set to zero. By default, a DummyEstimator predicting the classes priors is used.
  • There is no class_weights parameter in gradient boosting.

Let's try using AdaBoost classifier as the estimator for initial predictions

In [170]:
gbc_init = GradientBoostingClassifier(
    init=AdaBoostClassifier(random_state=1), random_state=1
)
gbc_init.fit(X_train, y_train)
Out[170]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           random_state=1)
In [171]:
# Using above defined function to get accuracy, recall and precision on train and test set
gbc_init_score = get_metrics_score(gbc_init)
Accuracy on training set :  0.8866432337434095
Accuracy on test set :  0.8545081967213115
Recall on training set :  0.46417445482866043
Recall on test set :  0.36594202898550726
Precision on training set :  0.873900293255132
Precision on test set :  0.7266187050359713

As compared to the model with default parameters:

  • Test accuracy and test recall have increased slightly.
  • As we are getting better results, we will use init = AdaBoostClassifier() to tune the gradient boosting model.
In [172]:
# Choose the type of classifier.
gbc_tuned = GradientBoostingClassifier(
    init=AdaBoostClassifier(random_state=1), random_state=1
)
In [173]:
# Grid of parameters to choose from
## add from article
parameters = {
    "n_estimators": [100, 150, 200, 250],
    "subsample": [0.8, 0.9, 1],
    "max_features": [0.7, 0.8, 0.9, 1],
}
In [174]:
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
In [175]:
# Run the grid search
grid_obj = GridSearchCV(gbc_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
In [176]:
# Set the clf to the best combination of parameters
gbc_tuned = grid_obj.best_estimator_
In [177]:
# Fit the best algorithm to the data.
gbc_tuned.fit(X_train, y_train)
Out[177]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.8, n_estimators=250, random_state=1,
                           subsample=0.9)
In [178]:
# Using above defined function to get accuracy, recall and precision on train and test set
gbc_tuned_score = get_metrics_score(gbc_tuned)
Accuracy on training set :  0.9212067955477445
Accuracy on test set :  0.869535519125683
Recall on training set :  0.618380062305296
Recall on test set :  0.44565217391304346
Precision on training set :  0.9429928741092637
Precision on test set :  0.7639751552795031
In [179]:
make_confusion_matrix(gbc_tuned, y_test)

Insights¶

  • The model performace has not increased by much.
  • The model has started to overfit the train data in terms of recall.
  • It is better at identifying non-defaulters than identifying defaulters which is the opposite of the result we need.
In [180]:
importances = gbc_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)
In [181]:
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • 'MonthlyIncome' is the most important feature, followed by 'Age', 'Passport' and 'DurationOfPitch', as per the tuned gradient boosting model

Tuning XGBoost Classifier

XGBoost has many hyper parameters which can be tuned to increase the model performance. You can read about them in the xgboost documentation here. Some of the important parameters are:

  • scale_pos_weight:Control the balance of positive and negative weights, useful for unbalanced classes. It has range from 0 to $\infty$.
  • subsample: Corresponds to the fraction of observations (the rows) to subsample at each step. By default it is set to 1 meaning that we use all rows.
  • colsample_bytree: Corresponds to the fraction of features (the columns) to use.
  • colsample_bylevel: The subsample ratio of columns for each level. Columns are subsampled from the set of columns chosen for the current tree.
  • colsample_bynode: The subsample ratio of columns for each node (split). Columns are subsampled from the set of columns chosen for the current level.
  • max_depth: is the maximum number of nodes allowed from the root to the farthest leaf of a tree.
  • learning_rate/eta: Makes the model more robust by shrinking the weights on each step.
  • gamma: A node is split only when the resulting split gives a positive reduction in the loss function. Gamma specifies the minimum loss reduction required to make a split.
In [182]:
# Choose the type of classifier.
xgb_tuned = XGBClassifier(random_state=1, eval_metric="logloss")

# Grid of parameters to choose from
## add from
parameters = {
    "n_estimators": np.arange(10, 100, 20),
    "scale_pos_weight": [0, 1, 2, 5],
    "subsample": [0.5, 0.7, 0.9, 1],
    "learning_rate": [0.01, 0.1, 0.2, 0.05],
    "gamma": [0, 1, 3],
    "colsample_bytree": [0.5, 0.7, 0.9, 1],
    "colsample_bylevel": [0.5, 0.7, 0.9, 1],
}
In [183]:
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
In [184]:
# Run the grid search
grid_obj = GridSearchCV(xgb_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
In [185]:
# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_
In [186]:
# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
Out[186]:
XGBClassifier(colsample_bylevel=0.5, colsample_bytree=0.5,
              eval_metric='logloss', learning_rate=0.01, n_estimators=30,
              random_state=1, scale_pos_weight=5, subsample=0.9)
In [187]:
# Using above defined function to get accuracy, recall and precision on train and test set
xgb_tuned_score = get_metrics_score(xgb_tuned)
Accuracy on training set :  0.6994727592267135
Accuracy on test set :  0.6653005464480874
Recall on training set :  0.7741433021806854
Recall on test set :  0.7681159420289855
Precision on training set :  0.3606676342525399
Precision on test set :  0.3322884012539185
In [188]:
make_confusion_matrix(xgb_tuned, y_test)

Insights¶

  • The test accuracy of the model has reduced as compared to the model with default parameters but the recall has increased significantly and the model is able to identify most of the defaulters.
  • Decreasing number of false negatives has increased the number of false positives here.
  • The tuned model is not overfitting and generalizes well.
In [189]:
importances = xgb_tuned.feature_importances_
indices = np.argsort(importances)
feature_names = list(X.columns)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
  • 'Passport' is the most important feature as per XGBoost model unlike AdaBoost and Gradient Boosting, where the most important feature is the 'MonthlyIncome'.

Model performance comparison on various metrics

In [190]:
# defining list of models
models = {
    # -------- bagging models --------
    "bagging": bagging,  # bagging
    "bagging weighted": bagging_wt,  # bagging weighted
    "decision tree": dtree,  # decision tree
    "random forest": rf,  # random forest
    "random forest weighted": rf_wt,  # random forest weighted
    "bagging tuned": bagging_estimator,  # bagging tuned
    "decision tree tuned": dtree_estimator,  # decision tree tuned
    "random forest tuned": rf_estimator,  # random forest tuned
    # -------- boosting models --------
    "adaboost with default parameters": abc,  # adaboost with default parameters
    "adaboost tuned": abc_tuned,  # adaboost tuned
    "gradient boosting with default parameters": gbc,  # gradient boosting with default parameters
    "gradient boosting with init=adaboost": gbc_init,  # gradient boosting with init=AdaBoost
    "gradient boosting tuned": gbc_tuned,  # gradient boosting tuned
    "xgboost with default parameters": xgb,  # xgboost with default parameters
    "xgboost tuned": xgb_tuned,  # xgboost tuned
}
In [191]:
# dataframe consolidating all models metrics for `'bagging'` and `'boosting'` on train and test
df_models = pd.DataFrame()
for model_id, model in models.items():
    df_concat = pd.DataFrame()
    for split in ["train", "test"]:
        if split == "train":
            df_train = np.round(
                model_performance_classification_sklearn(model, X_train, y_train), 2
            )
            df_train = pd.concat(
                [pd.DataFrame([split], index=[0], columns=["split"]), df_train],
                axis=1,
            )
        else:
            df_test = np.round(
                model_performance_classification_sklearn(model, X_test, y_test), 2
            )
            df_test = pd.concat(
                [pd.DataFrame([split], index=[0], columns=["split"]), df_test], axis=1
            )
            # concatenated training and test results for `'model_id'` models
            df_concat = pd.concat([df_train, df_test], axis=1)
            df_concat.index = [model_id]
    df_models = pd.concat([df_models, df_concat], axis=0)
df_models
Out[191]:
split Accuracy Recall Precision F1 split Accuracy Recall Precision F1
bagging train 0.99 0.96 1.00 0.98 test 0.89 0.53 0.86 0.65
bagging weighted train 0.99 0.97 1.00 0.98 test 0.90 0.62 0.82 0.70
decision tree train 1.00 1.00 1.00 1.00 test 0.89 0.72 0.70 0.71
random forest train 1.00 1.00 1.00 1.00 test 0.91 0.57 0.92 0.71
random forest weighted train 1.00 1.00 1.00 1.00 test 0.91 0.60 0.92 0.72
bagging tuned train 1.00 1.00 1.00 1.00 test 0.91 0.63 0.88 0.74
decision tree tuned train 0.99 0.94 1.00 0.97 test 0.88 0.65 0.70 0.67
random forest tuned train 1.00 1.00 1.00 1.00 test 0.92 0.65 0.90 0.75
adaboost with default parameters train 0.85 0.34 0.68 0.46 test 0.84 0.32 0.63 0.43
adaboost tuned train 0.99 0.94 0.98 0.96 test 0.84 0.54 0.60 0.57
gradient boosting with default parameters train 0.89 0.47 0.87 0.61 test 0.85 0.37 0.72 0.49
gradient boosting with init=adaboost train 0.89 0.46 0.87 0.61 test 0.85 0.37 0.73 0.49
gradient boosting tuned train 0.92 0.62 0.94 0.75 test 0.87 0.45 0.76 0.56
xgboost with default parameters train 0.88 0.43 0.86 0.58 test 0.86 0.36 0.76 0.48
xgboost tuned train 0.70 0.77 0.36 0.49 test 0.67 0.77 0.33 0.46

Remarks

  • A Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number.
  • Company "Visit for Us"'s main aim would be to balance the trade off between losing an opportunity (to gain money by selling packages) in case of FP and losing the money in case of FN.
  • We emphasized that 'Recall' is the metric of interest here and we tuned our model on 'Recall'. But this does not mean that other metrics should be ignored completely.
  • Here, we assumed that the $cost\ on\ FN > cost\ on\ FP$, but we do not want to misclassify so many non-buyers that the equation is reversed i.e. $cost\ on\ FP>cost\ on\ FN$, hence, the Company "Visit for Us" will actually be losing money in the longer run.

Business Insights - Recommendations

  • From the consolidated table of results with metrics, the highest 'Accuracy' on 'testing' is for the following models random forest, random forest weighted, xgboost with defalt parameters the same with 0.91, and random forest tuned with 0.92.

  • Highest values of 'Recall' on testing are from decision tree at 0.72, and xgboost tuned at 0.77. Their respective values on training are 1.00 and 0.97.

  • Although these values looks a little low (at least for testing) for the problem at hand, we believe these values actually are the more stable values from these two models and the xgboost tuned tend to be more generalizable than decision tree.

  • On the importance of variables, 'Passport' is the most important feature on the xgboost tuned model, followed by 'MaritalStatus', 'Age', and 'MonthlyIncome'. Unlike Gradient Boosting, where the most important feature is the 'MonthlyIncome', followed by 'Age', 'Passport', and 'DurationOfPitch'. In AdaBoost, the order of importance is 'MonthlyIncome', 'DurationOfPitch', 'Age', and 'NumberOfTrips' respectively from higher to lower.

  • We were able to build an xgboost tuned model that is able to generalize by optimizing the Recall metric by using cross validation.

  • We recommend to focus on customers meeting criteria around 'Passport', 'MaritalStatus', 'Age', and 'MonthlyIncome' as per the xgboost tuned model results. Focusing in that we can 'care' about these features will support the fact the model is providing with 'reasonable' results that has a limited number of $False\ Negatives$.

Generate html version of the python Notebook

In [194]:
# !jupyter nbconvert --to html --template full Project4_Travel_Package_Purchase_Prediction.ipynb